Polynomial Regression

 

Example: QUADRATIC MODEL FOR PREDICTING THE US POPULATION

 

> setwd("C:\Users\baron\624 Data Analysis\data")

> read.csv("USpop.csv")

   Year Population

1  1790        3.9

2  1800        5.3

3  1810        7.2

4  1820        9.6

5  1830       12.9

6  1840       17.1

7  1850       23.2

8  1860       31.4

9  1870       38.6

10 1880       50.2

11 1890       63.0

12 1900       76.2

13 1910       92.2

14 1920      106.0

15 1930      123.2

16 1940      132.2

17 1950      151.3

18 1960      179.3

19 1970      203.3

20 1980      226.5

21 1990      248.7

22 2000      281.4

23 2010      308.7

 

> Data = read.csv("USpop.csv")

> names(Data)

[1] "Year"       "Population"

 

> attach(Data)

> plot(Year,Population)

 

 

> # LINEAR MODEL

 

> lin = lm(Population ~ Year)

> summary(lin)

 

Call:

lm(formula = Population ~ Year)

 

Residuals:

    Min      1Q  Median      3Q     Max

-27.774 -24.872  -6.295  18.374  55.087

 

Coefficients:

              Estimate Std. Error t value

(Intercept) -2.481e+03  1.672e+02  -14.84

Year         1.360e+00  8.794e-02   15.47

            Pr(>|t|)   

(Intercept) 1.33e-12 ***

Year        5.93e-13 ***

---

Signif. codes: 

  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’

  0.1 ‘ ’ 1

 

Residual standard error: 27.97 on 21 degrees of freedom

Multiple R-squared:  0.9193,    Adjusted R-squared:  0.9155

F-statistic: 239.3 on 1 and 21 DF,  p-value: 5.927e-13

 

> abline(lin,col="red",lwd=3)

 

Ø  # Clearly, the linear model is too inflexible and restrictive, it does not provide a good fit. This is underfitting. Notice, however, that R2 in this regression is 0.9193. Without looking at the plot, we could have assumed that the model is very good!

 

> predict(lin,data.frame(Year=2020))

       1

267.2166

 

Ø  # This is obviously a poor prediction. The US population was already 308 million during the most recent Census.

Ø  So, we probably omitted an important predictor. Residual plots will help us determine which one.

Ø  Let’s produce various related plots.

Ø  Partition the graphics window into 4 parts and use “plot”.

 

> par(mfrow=c(2,2))

> plot(lin)

 

Ø  # The first plot shows that a quadratic term has been omitted although it is important in the population growth.

 

Ø  # So, fit a quadratic model.

 

Ø  Command I(…) means “inhibit interpretation”, it forces R to understand (…) literally, as Year squared.

 

> quadr = lm(Population ~ Year + I(Year^2))

> par(mfrow=c(1,1))

> plot(Year,Population)

 

Ø  # Now let’s plot the fitted curve. Obtain fitted values.

 

> Yhat = fitted.values(quadr)

> Yhat

         1          2          3          4

  6.490870   5.870040   6.603913   8.692490

         5          6          7          8

 12.135771  16.933755  23.086443  30.593834

         9         10         11         12

 39.455929  49.672727  61.244229  74.170435

        13         14         15         16

 88.451344 104.086957 121.077273 139.422292

        17         18         19         20

159.122016 180.176443 202.585573 226.349407

        21         22         23

251.467945 277.941186 305.769130

 

> lines(Year,Yhat,col="blue",lwd=3)

 

# This looks like a very good fit! Only during WWII, the population grew at a rate slightly below the quadratic trend, although it quickly caught up with it during the Baby Boom.

> summary(quadr)

 

Call:

lm(formula = Population ~ Year + I(Year^2))

 

Residuals:

    Min      1Q  Median      3Q     Max

-7.8220 -0.7130  0.5961  1.8344  3.7487

 

Coefficients:

              Estimate Std. Error t value

(Intercept)  2.194e+04  5.795e+02   37.87

Year        -2.438e+01  6.105e-01  -39.94

I(Year^2)    6.774e-03  1.606e-04   42.17

            Pr(>|t|)   

(Intercept)   <2e-16 ***

Year          <2e-16 ***

I(Year^2)     <2e-16 ***

---

Signif. codes: 

  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’

  0.1 ‘ ’ 1

 

Residual standard error: 3.023 on 20 degrees of freedom

Multiple R-squared:  0.9991,    Adjusted R-squared:  0.999

F-statistic: 1.113e+04 on 2 and 20 DF,  p-value: < 2.2e-16

 

Ø  A higher R-squared is not surprising. It will always increase when we add new variables to the model. The fair criterion is Adjusted R-squared, when we compare models with a different number of parameters. Quadratic model has Adjusted R-squared = 0.999 comparing with 0.9155 for the linear model.

 

> predict(quadr,data.frame(Year=2020))

       1

334.9518

 

Ø  Now, this is a reasonable prediction for the year 2020.

 

> predict(quadr,data.frame(Year=2020),interval="confidence")

       fit      lwr      upr

1 334.9518 330.6373 339.2663

> predict(quadr,data.frame(Year=2020),interval="prediction")

       fit     lwr      upr

1 334.9518 327.311 342.5925

 

Ø  Are the confidence and predictions intervals valid here? We’ll discuss regression assumptions next time.